Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR
نویسندگان
چکیده
It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.
منابع مشابه
Finding the Better Indexing units for Chinese Information Retrieval
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investi...
متن کاملEvaluation via Negativa of Chinese Word Segmentation for Information Retrieval
Numerous studies have analyzed the influences of word segmentation (WS) performance on information retrieval (IR) for Mandarin Chinese and have demonstrated a non-monotonic relationship between WS accuracy and IR effectiveness. The usefulness of the compound words that have been a focus of the IR literature is not reflected by common WS evaluation metrics of word-based precision (P) and recall ...
متن کاملApplication of the Tightness Continuum Measure to Chinese Information Retrieval
Most word segmentation methods employed in Chinese Information Retrieval systems are based on a static dictionary or a model trained against a manually segmented corpus. These general segmentation approaches may not be optimal because they disregard information within semantic units. We propose a novel method for improving word-based Chinese IR, which performs segmentation according to the tigh...
متن کاملEnglish-Chinese Cross-Language IR Using Bilingual Dictionaries
This report describes the English-Chinese crosslanguage experiments at Berkeley for TREC-9 CrossLanguage Information Retrieval track. We present a simple and effective Chinese word segmentation method and compare the cross-language retrieval performance of two bilingual dictionaries for query translation.
متن کاملعملکرد حافظهی سرگذشتی و مسئلهگشایی در پیوستار زندگی ـ مرگ : پژوهشی در بیماران افسرده
Abstract Introduction: This study aimed to determine the relationship between generality in retrieval from autobiographical memory in depressed patients and functional deficit in problem-solving strategies. Method: This survey analyzed the findings of several previous studies that investigated the subject of retrieval from autobiographical memory and the process of problem-solving among four g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002